Language modeling using x-grams
نویسندگان
چکیده
In this paper, an extension of n-grams is proposed. In this extension, the memory of the model (n) is not fixed a priori. Instead, first, large memories are accepted and afterwards, merging criteria are applied to reduce complexity and to ensure reliable estimations. The results show how the perplexity obtained with x-grams is smaller than that of n-grams. Furthermore, the complexity is smaller than trigrams and can become close to bigrams.
منابع مشابه
N-Gram Language Modeling for Robust Multi-Lingual Document Classification
Statistical n-gram language modeling is used in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of terms. This paper reports about ongoing research in the MEMPHIS project which employs models based on character-level n-grams instead of term n-grams. The models ar...
متن کاملUsing x-gram for efficient speech recognition
X-grams are a generalization of the n-grams, where the number of previous conditioning words is different for each case and decided from the training data. X-grams reduce perplexity with respect to trigrams and need less number of parameters. In this paper, the representation of the x-grams using finite state automata is considered. This representation leads to a new model, the non-deterministi...
متن کاملInterpolated Dirichlet Class Language Model for Speech Recognition Incorporating Long-distance N-grams
We propose a language modeling (LM) approach incorporating interpolated distanced n-grams in a Dirichlet class language model (DCLM) (Chien and Chueh, 2011) for speech recognition. The DCLM relaxes the bag-of-words assumption and documents topic extraction of latent Dirichlet allocation (LDA). The latent variable of DCLM reflects the class information of an n-gram event rather than the topic in...
متن کاملGetting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures
Sources of training data suitable for language modeling of conversational speech are limited. In this paper, we show how training data can be supplemented with text from the web filtered to match the style and/or topic of the target recognition task, but also that it is possible to get bigger performance gains from the data by using class-dependent interpolation of N-grams.
متن کاملVers une modélisation statistique multi-niveau du langage, application aux langues peu dotées. (Toward a multi-level statistical language modeling for under-resourced language)
This PhD thesis focuses on the problems encountered when developing automatic speech recognition for under-resourced languages with a writing system without explicit separation between words. The specificity of the languages covered in our work requires automatic segmentation of text corpus into words in order to make the n-gram language modeling applicable. While the lack of text data has an i...
متن کامل